Skip to content

Conversation

JonasKunz
Copy link
Contributor

@JonasKunz JonasKunz commented Oct 17, 2025

Part of #135625 , follow up of #136075.

Makes ExponentialHistogramState mimic the TDigestState provided functionality used by aggregations, so that we can build the drop-in replacement HistogramState consisting of both. This will then allow us to apply e.g. percentile aggregation on exponential histograms in addition to T-Digests and a mix of both.

We also had to implement a centroids() functionality, which is implemented by returning the mean values of the populated histogram buckets. Based on my research, centroids() is only used in the boxplot aggregation in order to define the length of the whiskers, where this usage should be fine.

@elasticsearchmachine elasticsearchmachine added external-contributor Pull request authored by a developer outside the Elasticsearch team v9.3.0 labels Oct 17, 2025
@JonasKunz JonasKunz marked this pull request as ready for review October 17, 2025 14:22
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Oct 17, 2025
public double cdf(double x) {
ExponentialHistogram histogram = histogram();
long numValuesLess = ExponentialHistogramQuantile.estimateRank(histogram, x, false);
long numValuesLessOrEqual = ExponentialHistogramQuantile.estimateRank(histogram, x, true);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noticed, these don't check if the value is outside min/max? You should do that separately, return 0 for < min and 1 for > max.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correctness wise, we already do this here.

So you are suggesting to do this for performance reasons?
I'm not sure if this is worth it, because it increases the complexity: We currently clamp the bucket POLRE to the histogram min / max. So if the POLRE was clamped, we need to return different values based on whether the requested rank was inclusive or exclusive equal values.

I don't think that this is a common enough case to justify this extra code

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this increase the complexity? Isn't this a simple condition to return early, that can be added at the top of ExponentialHistogramQuantile.estimateRank? It should be uncontroversial, in terms of correctness, and more efficient.

Btw, this is unrelated to this pr so not a blocker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Analytics/Aggregations Aggregations external-contributor Pull request authored by a developer outside the Elasticsearch team >non-issue Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants